layout: post title: Exploration in Modeling with SF 311 subtitle: Trying to predict cover-img: /assets/img/rainbow.jpg

tags: [SF311, opendata, public, bay area, quality of life, modeling,predictions]

In [56]:
import pandas as pd
import numpy as np
import math 
import pprint
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
matplotlib.style.use('ggplot')
pp = pprint.PrettyPrinter(indent=4)
pd.options.display.max_columns = 100
pd.set_option('display.max_rows', 100)
df = pd.read_csv('reduced1.csv', header=0, error_bad_lines=False, engine="python", warn_bad_lines=False);

Dataset Introduction: San Francisco 311 Cases

Provided by DataSF, this set is a monstrosity containing about 4.25 Million rows and counting. For those not familiar, 311 is a general customer service number for city government, most commonly associated with non-emergency complaints. 311 cases can be created via phone, mobile, and web. The dataset covers the time period from July 2008 until present.

In order to do data exploration and analysis of this dataset we needed to make a working sample to reduce memory and cpu usage upfront:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}' filename

Further information about the dataset can be found here.

One of the fascinating aspects of city living is human behavior.

One behavior every human seems to enjoy is complaining, in some shape way or form. Whether its done in exercise of right to recourse, for entertainment, or out of civic duty, we encounter a situtation that make us uncomfortable... in SF, we file a 311 case on it. You can even do it via Twitter! I wanted to focus on the subset of cases that reflect the spirit of 'complaints--' in particular those concerned with the behavior of others-- and generally require some sort of short term physical response.

Sadly, it seems there aren't many cases filed to commend folks for good behavior, so we will be looking at mostly negative or unpleasant situations here. Accordingly, we have attempted to exclude cases concerning general administrative requests, such as building permits, tree maintenance, and the like. In addition, despite it being filled with negative comments, I also chose to exclude the muni category, insofar as the Muni (city bus & train operators) is its own organization with its own culture, that I don't care to upset by pointout the exceedingly high volume of complaints.

From my personal observation, corroborated by many of my peers, once the 311 case is filed it goes into a black box, and we can only hope to guess at if or when the matter will be addressed. This can be very frustrating for the complainant, and in turn likely results in corresponding frustration for the people inside the black box, who receive many repeated complaints each day .

If only there were a way to predict how long it would take to resolve each issue...

Well, luckily, there are ways, in particular statistical learning models, and we shall see what fruit they may bear.

Datacleaning highlights:

  • encode, convert or cast every single feature
  • extreme values
  • malformed data
  • datetime index
  • 4.2M rows
In [57]:
## experiment in oop
## instantiate  a wrangler and setup drops and encoding. 
## run methods on the internal working copy to iterate 
## 

class Wrangler:        
    """the intent is for each instance of wrangler to contain lists of 
    specific transformations to be performed on an dataframe
    by the class methods. """
    #class vars
    dropcols = list()    #list to store columns to drop
    droprows = dict()    #row boolean filters

    encoders = dict()    # colname, function to be applied to elements
    
    dropcols_post = list()  #keep a distinct list to drop after enconding

    def calc_lt_005_cat(self, df): 
        """ return a list of the bottom .5% category names for binning"""
        return df.Category.value_counts()[df.Category.value_counts(normalize=True).values < .005]

    def __init__(self, data):
        self.raw_df = data.copy()         #preserve the original dataframe
        self.working = self.raw_df.copy() #this will be passed around or be used as default by class methods
        
    #methods
    def set_data(self,newdf):
        """reset dataframe to new df, leaving configuration intact"""
        print('set data to df', df.shape)
        self.raw_df= newdf.copy()
        self.working = self.raw_df.copy()
        return self.working
    
    def add_to_dropcols(self,labels):
        for l in labels:
            if l in self.dropcols:
                print('error column already in dropcols list')
                break
                return self.dropcols
        self.dropcols.extend(labels)
        return self.dropcols

    def get_dropcols(self):         #why use accessor methods in python? 
        
        if not self.dropcols: 
            print('no columns to drop')
        return self.dropcols
    
    def clear_dropcols(self):
        self.dropcols=list()
        return True
    def get_nancols(self, df=None, store=False):
        if df is None: 
            df=self.working
        cols =df.columns
        nc=df[cols].isnull().sum().index.tolist()
        if store:
            self.nancols = nc
        return nc
    def add_to_droprows(self, name, expr):
        self.droprows[name] = expr 
        return self.droprows.keys()

   
    def add_encoding(self, label, mapper):
        try:
            label in self.working.columns
        except:
            print(label, 'not found in working copy, may have been dropped')
            assert label in self.raw_df.columns
        self.encoders[label]= mapper
    ###TODO: 
    def get_params(self):
        pass
        #return all wrangler settings 
    def set_nanval(self):   
        pass 
       # self.nanvals add self to 

#"internal"
    def drop_rows_by_mask(self,df=None,rowmask=None):
        """takes a dataframe, a boolean row rowmask to drop INPLACE,
        or drop all the rows in self.droprows,
        returns the modified dataframe"""
        if df is None:
            df=self.working
        if rowmask is None:
            print('dropping all in droprows..')
            if self.droprows:
                dropdict = list(self.droprows.keys())
            rowmask = self.droprows[dropdict[0]]  #get first boolean mask
            for l in dropdict: 
                print('\napplying mask: ',l )
                rowmask = rowmask | self.droprows[l]   #or them all together
        df.drop(index=df[rowmask].index,inplace=True)
        return df 
    
    def drop_dupes(self, df=None):
        """INPLACE: drop duplicate rows in df,
        return modified copy"""
        if df is None:
            df=self.working
        todropindex = df[df.duplicated()].index
        print('\n dropping',todropindex.shape, 'rows')
        df.drop(todropindex,axis=0, inplace=True)
        return df

    def drop_columns(self,df=None, list=None):
        """INPLACE if df provided, use working
        if no list is provided, drop all columns in list
        if list is provided, drop only the columns in argument list
        and add them to the dropcols list
        return modified dataframe"""
        
        if df is None:
            df = self.working  
        if list:                          #list of droplabels is passed, added to self.dropcols,then dropped.
            for l in list:
                if l in self.dropcols:
                    print('error column already in dropcols list')
                    break
                    return self.dropcols
            self.dropcols.extend(list)
        else:
            list= self.dropcols
                                   #TODO: check dropcols present in df.columns
        return df.drop(labels=list, axis=1, inplace=True)
    
    def drop_columns_post(self):
        """INPLACE drop any columsn in list dropcols_post 
        dont return anything"""
        drop_columns(list=dropcols_post)
        
    def bin_othercats(self):
        """to help reduce cardinality bin the lower .5pct of categories as other
            CAUTION need to do this after dropping bad data rows, it will change every time
            the df is resampled.... """
        df = self.working
        othercats = self.calc_lt_005_cat(df).index.tolist()
        df['Category'] = df.Category.map(lambda cat : 'other' if cat in othercats else cat)
        
    def encode(self,df=None, label=None, fun=None):
        """label is key for dict AND is column label 
        fun is a function for pd.Series.map() 
        """
        if df is None: 
            df=self.working
            print('\n Encoding, changing working copy..')
        if fun is None:
            fun = list(self.encoders.keys()) 
            for k in fun:
                print('  ... encoding column: ',k)
                df[k] = df[k].map(self.encoders[k])
        else:
            df[label] = df[label].map(fun)
        return df
    
    def to_datetime(self, cols=None, df=None):
        if df is None:
            df=self.working
            print('\nworking df is being changed..')
        if cols is None:
            cols=self.dt_cols
        for c in cols:
            print('converting',c,'to datetime')
            try:
                df[c] = pd.to_datetime(df[c],infer_datetime_format=True)  #inplace 
            except: 
                print('error - possible this column needs cleaning')
        return df

       
    def make_feature(self,newlabel,input,fun):      #make or overwrite column newlabel
        """input=label, or list of labels that form series to apply fun"""
        df=self.working
        try:
            df[newlabel] = df[input].apply(fun, axis=1)  #or map or tranform?
            print('\nadded feature:', newlabel)
        except:
            print('there was a problem, not added!!')
            return False
        return True
    def calc_open_cases(self, sometime):  #input time
        df= self.working[['CaseID','Opened','Closed']]
        opened_prior = df['Opened'] < sometime        # cases opened before it,
        not_closed_prior = ~(df['Closed'] < sometime) # not closed, 
        open_at_thattime = opened_prior & not_closed_prior  #and 
        return open_at_thattime.sum()
    
    order_default = [drop_rows_by_mask,
                    drop_columns, 
                    drop_dupes,
                    encode,
                    drop_columns_post]   #list of methods in order of application

    def wrangle(self, df=None):

        if df is None:
            df =self.raw_df.copy() # start from the beginning 
            result = df
            print('will apply :', self.order_default)
            for f in self.order_default:
                print('level')
                result = f(result)
        return result
    
In [58]:
##column lists after exploration and research

delete = [s for s in df.columns.values if 'DELETE - ' in s]   #columns have delete in name, the appear to be older version of data
boundary = [s for s in df.columns.values if 'Boundary' in s]   #columns have boundary in name, dont need, redundant with neighborhood
floatcols = df.select_dtypes(include='float64').columns

##define settings here 
wrangler = Wrangler(df)   
wrangler.clear_dropcols()

wrangler.add_to_dropcols(delete)
wrangler.add_to_dropcols(['Point'])  ##redundant w/ lat long, but may use for geospatial later
wrangler.add_to_dropcols(['Parks Alliance CPSI (27+TL sites)','Supervisor District'])
wrangler.add_to_dropcols(['Central Market/Tenderloin Boundary Polygon - Updated'])
wrangler.add_to_dropcols(['Civic Center Harm Reduction Project Boundary',
       'Fix It Zones as of 2017-11-06 ', 'Invest In Neighborhoods (IIN) Areas',
       'Fix It Zones as of 2018-02-07','SF Find Neighborhoods',
       'CBD, BID and GBD Boundaries as of 2017','Current Supervisor Districts',
       'Central Market/Tenderloin Boundary', 'Areas of Vulnerability, 2016',
       'HSOC Zones as of 2018-06-05', 'OWED Public Spaces'])

missing_to_0 = lambda x : 0 if math.isnan(x) else int(x)   #convert float to int, nan to zeros
missing_to_unk =  lambda x : x if type(x) == str else 'missing'  #for strings 

wrangler.add_encoding('Request Details', missing_to_unk)
wrangler.add_encoding('Neighborhood', missing_to_unk)
wrangler.add_encoding('Police District', missing_to_unk)

wrangler.add_encoding('Analysis Neighborhoods', missing_to_0)  #convert neigborhoods
wrangler.add_encoding('Neighborhoods', missing_to_0) 
wrangler.add_encoding('Street', missing_to_unk )
wrangler.add_encoding('Media URL', lambda x : 'present' if type(x) == str else 'missing')
In [59]:
### special alert, some of our nasty data yields 

def caseid_to_int (x):    #convert good caseids to int, bad to zero for drop
    try:
        i = int(x)
    except:
        i = 0   
        return 0      #strings are converted to zero for dropping 
    return i

wrangler.droprows['drop_missing_caseid'] = df.CaseID.map(caseid_to_int) == 0    #mask to drop rows with no caseid
wrangler.add_encoding('CaseID', caseid_to_int)        #can be done any order
In [60]:
## Rows to drop 

nopen = df.Opened.isnull()    # bad records null, open date

wrangler.droprows['drop null open date'] = nopen #migh be fixed by  missing caseID?

nclosed= df.Closed.isnull()   # some are really still open this doenst need to be dropped.



excluded_cats =['SFHA Requests',     #these are adminstrative, e.g temp signs, or belong to other departments and are not centered around human behavior
                'Sidewalk or Curb',
                'Temporary Sign Request',
                'Tree Maintenance',
                'Rec and Park Requests']
#more infrastructure generic requests 
gen_req =['General Request - PUBLIC WORKS',    # is street cleaning in here?

        'Sewer Issues', 'Streetlights',
        'Sign Repair', 
        'General Request - PUC',
        'General Request - COUNTY CLERK',
        'General Request - 311CUSTOMERSERVICECENTER',               
        'General Request - MTA']
#
exclude_gen = df.Category.isin(gen_req)
exclude_svc = df.Category.isin(excluded_cats)  
muni = df.Category== 'MUNI Feedback'   #bus complaints we dont want, DPW doesnt service 

wrangler.add_to_droprows('muni', muni);
wrangler.add_to_droprows('svc_req', exclude_svc);
wrangler.add_to_droprows('gen_req',exclude_gen);

MUNI - A short Detour

Here are the top cases filed against the city's public tranportation operation:

In [61]:
(df[muni]['Request Type'] == 'MUNI - Commendation').mean()  ## at least 3% of these are positive. 
df[muni]['Request Type'].value_counts(ascending=False,normalize=True)   #LOL DROP
Out[61]:
MUNI - Conduct_Inattentiveness_Negligence                         0.240359
MUNI - Services_Service_Delivery_Facilities                       0.160987
MUNI - Conduct_Discourteous_Insensitive_Inappropriate_Conduct     0.112108
MUNI - Conduct_Unsafe_Operation                                   0.078027
MUNI - Services_Miscellaneous                                     0.055157
MUNI  - Services_Service_Delivery_Facilities                      0.051121
MUNI - Services_Service_Planning                                  0.046188
MUNI  -                                                           0.045291
MUNI  - Conduct_Discourteous_Insensitive_Inappropriate_Conduct    0.039462
MUNI - Commendation                                               0.039013
MUNI  - Conduct_Inattentiveness_Negligence                        0.035426
MUNI  - Services_Miscellaneous                                    0.031839
MUNI  - Conduct_Unsafe_Operation                                  0.024664
MUNI  - Services_Service_Planning                                 0.021525
MUNI - Services_Criminal_Activity                                 0.016592
MUNI  - Services_Criminal_Activity                                0.001794
SSP SFMTA Feedback                                                0.000448
Name: Request Type, dtype: float64

So.... I was wrong about it being ALL BAD...

Fully 3% of the MUNI cases are commendations for MUNI employees. Sounds about right. I would draw your attention to the subject of the remaining cases but I promised not to bash them...

We want to predict how long it will take for our ticket to be resolved but that feature didnt't exist.

So it was created by finding the time difference between the Opened and Closed timestamp. We will call this the Time To Resolution or *ttr for short:

  wrangler.make_feature('ttr',['Opened','Closed'],lambda x : x[1]-x[0])

In order to extract more predictive power from the past data, we create another feature, called workload, which is simply the number of cases that were open at the time of case creation.

def calc_open_cases(sometime):  #input time
    df= wrangler.working[['CaseID','Opened','Closed']]
    opened_prior = df['Opened'] < sometime        # cases opened before it,
    not_closed_prior = ~(df['Closed'] < sometime) # not closed, 
    open_at_thattime = opened_prior & not_closed_prior  #and 
    return open_at_thattime.sum()

Now lets look more at these and our features:

In [62]:
## these cols contain our timestamps as strings. we weill convert them to datetime object type for (in)convenience

dt_cols = ['Opened','Closed','Updated']
wrangler.dt_cols = dt_cols

## run  some of the setup functionality 
wrangler.drop_rows_by_mask()
wrangler.drop_dupes()
wrangler.to_datetime();
wrangler.bin_othercats()   
wrangler.encode();
wrangler.drop_columns(list=wrangler.dropcols_post)
wrangler.make_feature('ttr',['Opened','Closed'],lambda x : x[1]-x[0])
wrangler.working.Latitude = wrangler.working.Latitude.astype(float)    # :(
wrangler.working.Longitude = wrangler.working.Longitude.astype(float)    # :(
wrangler.working['case_year'] = pd.DatetimeIndex(wrangler.working.Opened).year
wrangler.working['case_month'] = pd.DatetimeIndex(wrangler.working.Opened).month

#add feature open cases, number of cases open at the time current ticket is open
#not super happy with this, since we have dropped many rows from consideration, so this statistic should be
#interpreted carefully

wrangler.working['workload'] = wrangler.working['Opened'].apply(wrangler.calc_open_cases)  ##done after e
#wrangler.make_feature('workload', ['Opened'],wrangler.calc_open_cases)
dropping all in droprows..

applying mask:  drop_missing_caseid

applying mask:  drop null open date

applying mask:  muni

applying mask:  svc_req

applying mask:  gen_req

 dropping (0,) rows

working df is being changed..
converting Opened to datetime
converting Closed to datetime
converting Updated to datetime

 Encoding, changing working copy..
  ... encoding column:  Request Details
  ... encoding column:  Neighborhood
  ... encoding column:  Police District
  ... encoding column:  Analysis Neighborhoods
  ... encoding column:  Neighborhoods
  ... encoding column:  Street
  ... encoding column:  Media URL
  ... encoding column:  CaseID

added feature: ttr
In [63]:
wrangler.make_feature('ttr',['Opened','Closed'],lambda x : x[1]-x[0])

def calc_open_cases(sometime):  #input time
    df= wrangler.working[['CaseID','Opened','Closed']]
    opened_prior = df['Opened'] < sometime        # cases opened before it,
    not_closed_prior = ~(df['Closed'] < sometime) # not closed, 
    open_at_thattime = opened_prior & not_closed_prior  #and 
    return open_at_thattime.sum()
    
###
##look at negative deltas, ddecide to drop 
df = wrangler.working 
neg_ttr = df.ttr.map(lambda td : td.total_seconds() < 0)
df[neg_ttr]['CaseID'].nunique() ,df[neg_ttr].shape       #one record per caseid here
#drop them, they look like data import errors and are clustered in specific years 
wrangler.drop_rows_by_mask(df=None, rowmask=neg_ttr)
df[neg_ttr].shape     ## verify dropped neg ttr
added feature: ttr
/home/dliu/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:17: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

Out[63]:
(0, 25)

DATA Exploration:

Daily Case Statistics

  • upward trend in new cases, workload
  • non-monotonic increase
  • extreme values in TTR
In [64]:
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,
                               AutoMinorLocator)
import matplotlib.dates as mdates

daily = wrangler.working.copy()

fig, ax= plt.subplots(3)
plt.subplots_adjust(left=None, bottom=0.1, right=None, top=None, wspace=None, hspace=0.5)

daily['date'] = pd.DatetimeIndex(daily.Opened).date
daily[['date','workload']].groupby('date').max().plot(legend=False, figsize=(20,8), ax=ax[1])
#daily_load.
daily[['date','CaseID']].groupby('date').count().plot(legend=False ,ax=ax[0])
daily[['date','ttr']].groupby('date').agg({'ttr': lambda t : np.mean(t.dt.total_seconds()/3600 )}).plot(ax=ax[2])

ax[0].set_title('Total new cases per day')
ax[1].set_title('Peak daily workload: (Max #  of open cases per day)')
ax[2].set_title('Average Time to Resolve (ttr) for the days tickets')

plt.show()

Monthly Case Statistics

  • use groupby() and agg()
  • mean ttr possible artifact created by dropping open cases the end of the dataset
  • NOTE: ttr is 'looking ahead' contains information about future state
In [65]:
wrangler.working.groupby(by=['case_year','case_month','Request Type'])[['CaseID','ttr']].agg({'CaseID': 'count'})
#produce stats by month and year
monthly_counts_by_request_type = wrangler.working.groupby(by=['case_year',
                                                              'case_month',
                                                              'Request Type']).agg({'CaseID': 'count', 
                                                                                    'workload': np.mean,
                                                                                    'ttr': lambda x : np.mean(x)})
monthly_counts_by_request_type.columns = ['case_count', 'avg_opencases', 'avg_ttr']
monthly_counts_by_request_type.sort_index(inplace=True)

monthly_counts = wrangler.working.groupby(by=['case_year','case_month']).agg({'CaseID': 'count', 'workload': np.mean, 'ttr': lambda x : np.mean(x)})
monthly_counts.columns= columns = ['case_count', 'avg_opencases', 'avg_ttr']
In [66]:
fig,ax = plt.subplots(1,3, figsize=(27,6))
m= np.mean(monthly_counts.avg_ttr).days
s= monthly_counts.avg_ttr.std().days


monthly_counts.columns =['case_count', 'avg_opencases', 'avg_ttr']  
monthly_counts.case_count.plot(ax=ax[0], title = 'New Cases by Month')
monthly_counts.avg_opencases.plot(ax=ax[1], title = 'Average # Open Cases at ticket creation')
monthly_counts.avg_ttr.plot(ax=ax[2], title= 'Monthly Mean Time to Resolution')
ax[2].text(.8,1000, f"ttr mean: {m}\nttr std: {s}", fontsize =15)
plt.show()

How can we model the ttr and predict the resolution time of a case opened?

Since the variance of the target is so high, and thinking about the wide range of request types being serviced, it seems reasonable to divide response time into a few classes. Doing so we may discover that certain types of requests tend to be resolved faster in general, or that there are other presently unknown factors that influence how soon our complaints are attended to..

We can explore the boundaries for the classes.

Everyone likes a low, medium, high system. We also preview the effect of dropping extreme values.

In [67]:
matplotlib.style.use('seaborn')

#explore ttr class distribution in the wrangled data
df= wrangler.working.copy()
df.ttr = df.ttr.dt.total_seconds()/3600

min_value =0
max_value =100
l_m_bound = 4
m_h_bound = 40
xb = 300      #low pass filter  


a = widgets.FloatRangeSlider(min=df.ttr.min(),  max=150,value=[l_m_bound,m_h_bound],
                             description='Bounds', readout_format='.1f',readout=False)
b = widgets.IntSlider(         #adjust extreme value cutoff
    value=xb,
    min=m_h_bound,max=xb,  
    description='extreme cutoff'
)
ui = widgets.HBox([a,b])
def f(a,b):
    fig, ax1 = plt.subplots(1,figsize =(15,9))
    lb,ub = a   #bounds for middle class
    xb=b
    def mapttrs(ttr):
        low  = ttr < lb   
        med  = lb <= ttr <= ub
        high = ub < ttr < xb
        ext  = ttr >= xb
        if low:
            return 'low'
        elif med:
            return 'med'
        elif high:
            return 'high'
        else:
            return 'xb'

    valct = df.ttr.apply(mapttrs).value_counts(normalize=True)
    xbval = (f"\nExtreme Values cutoff= {b}")
    medval= (f"bounds for med (hrs): {a}")   
    
    hcut = df.ttr > xb
    wrangler.drop_rows_by_mask(df=df,rowmask=hcut)     #should toggle this on and off! 
    df.ttr.plot(kind='hist', legend=True,ax=ax1, bins=100, alpha=.8)
    
    l,r = a
    mp = (l+r)/2    #midpoint :\
    ax1.axvline(x=l, color='blue', ls='--')
    ax1.axvline(x=r, color='blue', ls='--')
    ax1.text(1.1*r,1500,f"Class Sizes:\n{medval}\n{valct}\n{xbval}", fontsize=14) 
    ax1.text( mp,2000,f"med", ha='center',fontsize=14) 
    ax1.set_xlim(0,xb)
    

    plt.show()

out = widgets.interactive_output(f, {'a': a, 'b': b})
display(ui, out)

Assign the classes according to (my) common sense heuristic:

'low' ttr = 4 hr or less 'med' ttr = between 4 and 40 hours 'high'ttr = anything above 40 hrs

I loathe to drop rows if they dont represent bad data, and there is justification for some of the long ttrs.

df= wrangler.working.copy()
df.ttr = df.ttr.dt.total_seconds()/3600
df[df.ttr > xb].Category.value_counts().head(6)

  • Graffiti 1873
  • Street and Sidewalk Cleaning 610
  • Other 370
  • Damaged Property 348
  • Encampments 232
  • Street Defects 163

So we will add the target classes to as above with out dropping, yielding the following balance:

high    0.450375
med     0.285966
low     0.263659
In [68]:
## keep this, to setup target ttr_class on wrangler
df= wrangler.working.copy()
df.ttr = df.ttr.dt.total_seconds()/3600
df[df.ttr > xb].Category.value_counts().head(6)


def apply_ttr_class(ttr):
    """apply to wrangler, ttr needs conversion"""
    lb,ub = 4,40
    low  = ttr < lb   
    med  = lb <= ttr <= ub
    high = ub < ttr < xb
    
    if low:
        return 'low'
    elif med:
        return 'med'
    else:
        return 'high'
        
wrangler.working['ttr_class']= wrangler.working.ttr.dt.total_seconds()/3600
wrangler.working.ttr_class = wrangler.working['ttr_class'].apply(apply_ttr_class)
wrangler.working.ttr_class
Out[68]:
3        high
4         med
5        high
7         med
8        high
         ... 
41677     med
41679     low
41680     low
41681     med
41682     low
Name: ttr_class, Length: 30035, dtype: object

Time to predict the future....

We have simplifed our problem to predicting whether the case we open today will be resolved in one of three time intervals, making this a classfication problem. Our first approach will be to fit a logistic regression, followed by some tree based classifiers. At this point we'll drop CaseID, Status, Notes, and any TimeStamped columns that can either uniquely identify the case, or contain information that was not known at the time of creation. We can retain some features like hour of the day, day of week.

features = df.columns.drop(['CaseID','Opened','Closed','Updated','Status',
'Status Notes','ttr','workload','ttr_class']) #drop time or status related columns
In [69]:
##setup for classification 
df=wrangler.working.copy()
lrfeatures = df.columns.drop(['CaseID','Opened','Closed','Updated','Status','Status Notes','ttr','Current Police Districts', 'workload','ttr_class']) #drop time or status related columns
features= lrfeatures
target='ttr_class'
features,target
Out[69]:
(Index(['Responsible Agency', 'Category', 'Request Type', 'Request Details',
        'Address', 'Street', 'Neighborhood', 'Police District', 'Latitude',
        'Longitude', 'Source', 'Media URL', 'Analysis Neighborhoods',
        'Neighborhoods', 'case_year', 'case_month'],
       dtype='object'), 'ttr_class')
In [ ]:
 
In [70]:
from sklearn.linear_model import LogisticRegression
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.metrics import roc_auc_score, roc_curve, recall_score

target='ttr_class'

train,test = train_test_split(df, train_size=.8, random_state=99)
train,val  = train_test_split(train, train_size=.8, random_state=99)

print(train.shape, val.shape)

X_train = train[features]
y_train = train[target]
X_val   = val[features]
y_val   = val[target]

X_test = test[features]
y_test = test[target]

logreg = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    StandardScaler(),
    SelectKBest(k=10),
    LogisticRegression()
)
logreg.fit(X_train,y_train);
y_val_pred= logreg.predict(X_val)
y_pred_proba=logreg.predict_proba(X_val); #need for ROC
y_test_pred= logreg.predict(X_test)

accuracy_score(y_val,y_val_pred) ,accuracy_score(y_test,y_test_pred) 
(19222, 26) (4806, 26)
Out[70]:
(0.4929255097794424, 0.4899284168470118)

... and survey says, LogReg not so hot.

With validation and test accuracy of (0.4929255097794424, 0.490427834193441), this certainly sets a baseline for further model evaluation.

In [71]:
import category_encoders as ce
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import confusion_matrix

import eli5
from eli5.sklearn  import PermutationImportance

baseline= df[target].value_counts(normalize=True).values[0] #baseline

k=10

models = [DecisionTreeClassifier(),
          RandomForestClassifier(max_depth=10),
          XGBClassifier(max_depth=5, gamma=2)  ]
    
for model in models:
    pipe = pipe = make_pipeline(ce.OrdinalEncoder(),SimpleImputer(),StandardScaler(),SelectKBest(k=k),model)
    pipe.fit(X_train,y_train)
    print(list(pipe.named_steps.keys())[-1])
    print('Training Accuracy:', pipe.score(X_train, y_train))
    print('Validation Accuracy:', pipe.score(X_val, y_val))
    print('Test Accuracy:', pipe.score(X_test, y_test))
    print('baseline Accuracy:', baseline,'\n')
    print(classification_report(y_test, pipe.predict(X_test)))
decisiontreeclassifier
Training Accuracy: 0.9958901258974092
Validation Accuracy: 0.5624219725343321
Test Accuracy: 0.5568503412685201
baseline Accuracy: 0.45037456300982187 

              precision    recall  f1-score   support

        high       0.68      0.69      0.69      2633
         low       0.55      0.54      0.54      1590
         med       0.38      0.38      0.38      1784

    accuracy                           0.56      6007
   macro avg       0.54      0.54      0.54      6007
weighted avg       0.56      0.56      0.56      6007

randomforestclassifier
Training Accuracy: 0.7205805847466444
Validation Accuracy: 0.643986683312526
Test Accuracy: 0.6435824870983852
baseline Accuracy: 0.45037456300982187 

              precision    recall  f1-score   support

        high       0.67      0.91      0.77      2633
         low       0.65      0.68      0.67      1590
         med       0.51      0.22      0.30      1784

    accuracy                           0.64      6007
   macro avg       0.61      0.60      0.58      6007
weighted avg       0.62      0.64      0.60      6007

xgbclassifier
Training Accuracy: 0.6871293309749246
Validation Accuracy: 0.6437786100707449
Test Accuracy: 0.6445813217912435
baseline Accuracy: 0.45037456300982187 

              precision    recall  f1-score   support

        high       0.67      0.91      0.77      2633
         low       0.64      0.69      0.66      1590
         med       0.52      0.22      0.30      1784

    accuracy                           0.64      6007
   macro avg       0.61      0.60      0.58      6007
weighted avg       0.62      0.64      0.60      6007

In [72]:
y_val_pred = pipe.predict(X_val)

confusion = confusion_matrix(y_val, y_val_pred,  labels=None, sample_weight=None, normalize=None)
print(list(pipe.named_steps.keys())[-1])
print(confusion)

unique_labels(y_test)

def plot_confusion_matrix(y_true, y_pred):
    labels = unique_labels(y_true)
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]
    table = pd.DataFrame(confusion_matrix(y_true, y_pred), 
                         columns=columns, index=index)
    return sns.heatmap(table, annot=True, fmt='d', cmap='viridis')

plot_confusion_matrix(y_val, y_val_pred);
xgbclassifier
[[1921   58  139]
 [ 258  888  155]
 [ 680  422  285]]
In [73]:
 
#perm = PermutationImportance(model, random_state=887).fit(X_val_enc,y_val)
#eli5.show_weights(perm, feature_names = X_val_enc.columns.tolist())

feature_names = pipe.named_steps['ordinalencoder'].get_feature_names()
feature_names
Out[73]:
['Responsible Agency',
 'Category',
 'Request Type',
 'Request Details',
 'Address',
 'Street',
 'Neighborhood',
 'Police District',
 'Latitude',
 'Longitude',
 'Source',
 'Media URL',
 'Analysis Neighborhoods',
 'Neighborhoods',
 'case_year',
 'case_month']

Linear regression lite, Plotly Express with OLS trendline :

Without doing much more than a simple scatter plot of workload, we can visualize a regression line that captures the general upward trend inside of the undulation.

In [74]:
import plotly.express as px
from datetime import datetime
daily = wrangler.working.copy()
daily['date'] = pd.DatetimeIndex(daily.Opened).date

daily_load=daily[['date','workload']].groupby('date').max()
df_dl =daily_load.groupby('date').max()

df_dl['day_offset'] = (df_dl.index - df_dl.index[0]).days  #get an int represention of days since the first value


#need to create label list since we have to serialize the dates as ints for px to do the least squares
datelist = pd.date_range(datetime.today(), periods=df_dl.shape[0]).tolist()  #

fig = px.scatter(df_dl,x='day_offset', y='workload',trendline='ols')


fig.update_xaxes(tickangle=45,
                 tickmode = 'array',
                 tickvals = df_dl['day_offset'][0::40],
                 ticktext= [d.strftime('%Y-%m-%d') for d in datelist])

fig.show()
In [75]:
#lets see what happens if we just treat the workload to a linear regression 

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

train, test = train_test_split(df_dl, test_size=.25)     

X_train = train.day_offset.values.reshape(-1,1)
y_train = train.workload.values.reshape(-1,1)
X_test = test.day_offset.values.reshape(-1,1)
y_test = test.workload.values.reshape(-1,1)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(model.coef_, model.intercept_)      #same slope calculated by px above
print(model.score(y_pred, y_test), mean_absolute_error(y_pred,y_test))  #clearly regression leaves something to be desired.
[[0.09857098]] [-11.25573084]
-2.044872940246718 43.194449250712374

Linear Regression Model Coeffs and Error:

/The sklearn linear model yields a slope that is the same as that calculated by plotly express above: 0.098 Note that the intercept is slightly shifted- but we are using a random sample here vs the enter dataframe above. We use both the model's .score() method and the mean_absolute_error package to estimate how well the linear model fits the data. The negative R2 value of -2.09 is telling us that our model does no better than a mean baseline. Test MAE is 42.26893511094306/

df_dl.workload.mean(), df_dl.workload.std() = (207.0688439849624, 134.49243473471824)

Will a tree model perform better?

Linear regression isnt really working well for this data. Shall we see if a tree model can better handle the non-monotonic change in workload? Using the same train / test split from above, we have to shape the arrays differently before feeding them to the tree ensemble.

In [97]:
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual, IntRangeSlider, IntSlider
import ipywidgets as widgets


def f(md, xlims):
    model = RandomForestRegressor(max_depth=md)

    X_train = train.day_offset.values.reshape(-1, 1) 
    y_train = train.workload.values
    X_test = test.day_offset.values.reshape(-1, 1) 
    y_test = test.workload.values

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print('Mean Abs Err : ', mean_absolute_error(y_pred,y_test)) 
    
    fig,ax = plt.subplots()
    rfpred= pd.DataFrame(data=[y_test,y_pred]).T
    
    ax= rfpred[0].plot( figsize=(30,6), color='red')           # true values
    ax= rfpred[1].plot(                color='blue', alpha=.7) # predicted values
    ax.legend (['actual', 'predicted'])
    
    ax.set_xlim(xlims)
    plt.title('RF Model Predicted vs Actual \n Daily Workload: (Max Queue length)')
    plt.show()

xlimr= 1200

xlims =  IntRangeSlider(description='x window',
                        value=[0,xlimr],
                        min=   0,
                        max=   xlimr,
                        step = 50,
                        disabled=False,
                       readout=True)

md =  IntSlider(description='max_depth',
                min=1,
                max=10,
                step=1, 
                value=3)

interact(f, md=md,xlims=xlims)

xlimr
Out[97]:
1200

Got trees? Lets fit some more models and try to see the forest..

It is evident that the tree-based regressor has no problem fitting the wild variance of the daily queue length (workload) ##TODO## change to queue length? Using the slider to redraw the plot with increasing depth and watch the MAE drop. But are we done? Can this model really predict future workload? Since we did random sampling of our data we would expect the fitted model to predict well on the test sets, if we have enough data points in each set. It

Time-Based Split

To test our Random Forest model further, we can create a time based split. The datetime index is sorted, we will re-train the model using the first 80% of the observations along the time axis, with the remaining 20% for test. Use the training set% slider to observe the effect on the prediction. If you think that's bogus, use the dropdown, and watch what happens when you change the split as follows: out of the original train set, create a random 80% sample and use the remainder for validation. prepare to be amazed...

Linear Regression in detail:

lets see what happens if we just treat the workload to a linear regression, with a random split:

In [77]:
#use a time based split 
from ipywidgets import FloatSlider

def f(md,ts,tvt_split='OFF'):
    ## time axis split
    trainsize = int(df_dl.shape[0]*ts)  #initial split of 80/20 ffrom the slider default
    
    model = RandomForestRegressor(max_depth=md)
    
    train = df_dl.copy().iloc[0:trainsize]
    test =  df_dl.copy().iloc[trainsize:] 
    
    if tvt_split == 'ON' :                                           #random sample the first portion into train,val then we can test 
        train,val = train_test_split(train, test_size=.2)     
        X_val = val.day_offset.values.reshape(-1, 1) 
        y_val = val.workload.values
    
    X_train = train.day_offset.values.reshape(-1, 1) 
    y_train = train.workload.values
    X_test = test.day_offset.values.reshape(-1, 1) 
    y_test = test.workload.values
    
    model.fit(X_train, y_train)
    
    fig,ax = plt.subplots()
    ax.spines['left'].set_visible(True)
    ax.spines['bottom'].set_visible(True)
    if tvt_split == 'ON':             
        y_val_pred = model.predict(X_val)
        plt.figtext(.145,.3,
                    f"val mae:{mean_absolute_error(y_val_pred,y_val):.2f}", 
                    ha='left',fontsize=15,color='green')   
     
        val['Val pred'] = y_val_pred     #add the predictions for val to df
        
        val.sort_index(inplace=True)
        val.plot( y= 'workload',  figsize=(30,6), color='red',ax=ax) #val actual
        val.plot( y='Val pred',  figsize=(30,6), color='green', ax=ax)#val predicted 
        ax.legend (['val actual', 'val predicted'])

    y_pred = model.predict(X_test)   
    
    plt.figtext(.5,.9,'RF model predicted values vs. actual,\n as function of split and max_depth',fontsize=20,ha='center')

    plt.figtext(.75,.3,f"test mae: {mean_absolute_error(y_pred,y_test):.2f}", fontsize=15, color='blue')
    
    test['Test predicted']= y_pred
    
    test.plot(y='workload', figsize=(30,6), color='maroon', ax=ax) #actual workload
    test.plot(y='Test predicted', figsize=(30,6), color='blue', ax=ax)#predicted on TEST

ts =  FloatSlider(description='training set%',min=.50, max=.9, step=.05 , value=.8)    
md =  IntSlider(description='max_depth',min=1,max=10, step=1,  value=8)


tvt_split = widgets.Dropdown(
    options=['OFF','ON'],
    value='OFF',
    description='3-way SPLIT',
    disabled=False,
    orientation='horizontal'
)
interact(f, md=md, ts=ts, tvt_split=tvt_split)
Out[77]:
<function __main__.f(md, ts, tvt_split='OFF')>

Wow this tree model is very smart!

This is what happens when you train your model on a time series with time based split! It appears to have determined that its not worth the effort and just guess a constant! but this

Where do we go from here? Lets do some window-looking.

Now that we have run ican see the deficiencies of regression models, we need to introduce some means to better interpret the data. If there appears to be a periodic fluctuation, we can use a rolling average, or moving mean to smooth out short term variance and visualize larger trends. Below I introduce another calculated feature of the data, the time to resolution or ttr for a specific case. This is very simple; it is the time difference between the open and close for that case. In the data its expressed as a timedelta object, and cast to various datatypes, such as *

In [90]:
matplotlib.style.use('classic')

closed_ttr_day = wrangler.working.copy()


closed_ttr_day['ttr_sec'] = closed_ttr_day['ttr']  #keep seconds
closed_ttr_day['ttr'] = closed_ttr_day['ttr'].dt.total_seconds()/3600   #convert to hours

closed_ttr_day['date_close'] = pd.DatetimeIndex(closed_ttr_day.Closed).date

closed_ttr_day = closed_ttr_day.groupby('date_close')[['CaseID','workload','ttr']].agg({'CaseID': 'count', 'workload': np.mean, 'ttr': 'mean'})
closed_ttr_day = closed_ttr_day.groupby('date_close')[['CaseID','workload','ttr']].agg({'CaseID': 'count', 'workload': 'max', 'ttr': 'mean'})

closed_ttr_day.columns=['case_count','daily_workload', 'avg_ttr_hours']


years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month
years_fmt = mdates.DateFormatter('%Y')


ax= closed_ttr_day.daily_workload.plot(figsize=(25,12))

closed_ttr_day.daily_workload.rolling(window=10).mean().plot(ax=ax)
closed_ttr_day.daily_workload.rolling(window=30).mean().plot(ax=ax)
closed_ttr_day.daily_workload.rolling(window=360).mean().plot(ax=ax)

closed_ttr_day.avg_ttr_hours.rolling(window=120).mean().plot(ax=ax)      #


ax.set_title('Rolling Averages: \ndaily workload w /rolling avg windows \n time to resolution(ttr) superimposed')

ax.xaxis.grid(True, which='major')

ax.set_ylim(0,600)
ax.xaxis.set_major_locator(years)
ax.xaxis.set_minor_locator(months)
ax.xaxis.set_major_formatter(years_fmt)


ax.legend(['Daily Workload', 'm=10', 'm=30', 'm=360', 'ttr_hrs'])
plt.show()

Rolling Means for Daily Workload

To see through some of the daily variance, and try to look for seasonal effects we can calculate the rolling means with various window sizes. after doing so we can see that there is a slight upward linear trend in the daily workload but with the extreme smoothing using a 360 day window, the red line, there also appears to be a multi-year cycle, perhaps releated to staffing and or process changes, that reduce the workload for a time before the upward trend continues.

In contrast, the mean time to resolution (ttr) in hours shows extreme fluctuations, but perhaps without a clear upward trend. We'll look at this more later

In [79]:
matplotlib.style.use('ggplot')

ax= closed_ttr_day.daily_workload.plot(figsize=(18,5))
ax.set_title('daily workload & Variance over rolling w')

closed_ttr_day.daily_workload.rolling(window=10).var().plot(ax=ax)
closed_ttr_day.daily_workload.rolling(window=30).var().plot(ax=ax)
closed_ttr_day.daily_workload.rolling(window=50).var().plot(ax=ax)
closed_ttr_day.avg_ttr_hours.rolling(window=50).var().plot(ax=ax)

years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month
years_fmt = mdates.DateFormatter('%Y')


ax.legend(['Daily Workload', 'm=10', 'm=20', 'm=30','mean_ttr variation'])
ax.xaxis.grid(True, which='major')

ax.set_ylim(0,1500)
ax.xaxis.set_major_locator(years)
ax.xaxis.set_minor_locator(months)
ax.xaxis.set_major_formatter(years_fmt)

plt.show()
In [80]:
from sklearn.linear_model import LinearRegression
from statsmodels.graphics.tsaplots import plot_acf

x = closed_ttr_day.avg_ttr_hours
plot_acf(np.array(x),lags=10, title='mean time to resolution hrs, ACF')

#plot_acf(np.array(closed_ttr_day.case_count),lags=20, title='case counts')  #check case count (open/close count) autocorrelation
Out[80]:
In [81]:
from statsmodels.tsa.stattools import adfuller
result = adfuller(daily_load.iloc[:,0].values)
print(f'p value is {result[1]}')
p value is 0.41090577512001386

Since the p value islarger than the significance level of 0.05, we can reject the null hypothesis that the time series data is non-stationary. Thus, the time series data is stationary.

In [ ]:
 
In [82]:
result =closed_ttr_day.iloc[:,2].values    #column is mean time to resolution in hours

print(f'p value is {result[1]}')  # this series is NOT stationary
p value is 21.144166666666667
In [83]:
mean_ttr_by_cat = wrangler.working.groupby(['Category','Request Type']).ttr.apply(lambda x : np.mean(x))
#mean_ttr_by_cat.sort_values(ascending=False).head(5).index.tolist()
mean_ttr_by_cat.head(90)
Out[83]:
Category                    Request Type                                      
311 External Request        Damaged_Property                                      22 days 05:49:23.230769
                            Garbage                                              152 days 17:17:50.750000
                            Graffiti                                              59 days 17:47:51.275590
                            Human/Animal Waste                                            0 days 09:56:30
                            Illegal_Posting                                        6 days 16:38:17.500000
                            Other                                                  3 days 03:11:18.111111
                            Sewer_Water_Storm_Conditions                                  0 days 01:07:34
                            Sidewalk_or_Curb_Issues                               82 days 23:08:25.888888
                            Utility Lines/Wires                                  176 days 15:22:11.750000
                            Value5                                                        0 days 00:36:39
Abandoned Vehicle           Abandoned Vehicle -                                    5 days 06:25:14.260869
                            Abandoned Vehicle - Car2door                           5 days 01:01:20.125000
                            Abandoned Vehicle - Car4door                           5 days 01:44:22.452380
                            Abandoned Vehicle - DeliveryTruck                             4 days 22:35:36
                            Abandoned Vehicle - Motorcycle                         4 days 23:16:52.512820
                            Abandoned Vehicle - Other                              5 days 00:27:46.231707
                            Abandoned Vehicle - PickupTruck                        5 days 15:03:24.440217
                            Abandoned Vehicle - SUV                                4 days 23:49:26.029411
                            Abandoned Vehicle - Trailer                            8 days 11:14:50.588235
                            Abandoned Vehicles                                     4 days 10:35:59.627692
                            SSP Abandoned Vehicles                                 3 days 19:19:46.500000
Blocked Street or SideWalk  Blocked_Parking_Space_or_Strip                        41 days 06:06:05.009615
                            Blocked_Sidewalk                                      63 days 17:31:38.783919
Damaged Property            Damaged Benches_on_Sidewalk                          210 days 00:42:46.352941
                            Damaged Bike_Rack                                     14 days 16:20:10.142857
                            Damaged Fire_Police_Callbox                           47 days 13:44:53.545454
                            Damaged Kiosk_Public_Toilet                          210 days 10:47:42.500000
                            Damaged News_Rack                                     63 days 19:27:11.296296
                            Damaged Parking_Meter                                 29 days 19:02:16.024793
                            Damaged Traffic_Signal                               179 days 11:13:51.232142
                            Damaged Transit_Shelter_Ad_Kiosk                       0 days 21:10:05.416666
                            Damaged Transit_Shelter_Platform                      26 days 09:16:23.854166
                            Damaged Transit_Shelter_Platform_Hazardous            17 days 14:17:47.700000
                            Damaged other                                         64 days 04:30:36.750000
Encampments                 Encampment Reports                                     9 days 09:34:36.290220
                            Encampment items                                       4 days 01:57:16.049645
Graffiti                    Graffiti on ATT_Property                              17 days 05:22:59.238805
                            Graffiti on Bike_rack                                        82 days 05:59:04
                            Graffiti on Bridge                                   103 days 01:44:55.250000
                            Graffiti on Building_commercial                       38 days 01:46:18.186602
                            Graffiti on Building_other                           233 days 16:38:11.049904
                            Graffiti on Building_residential                      32 days 01:46:40.841317
                            Graffiti on City_receptacle                           14 days 07:21:46.980000
                            Graffiti on Fire_Police_Callbox                       69 days 22:51:28.395061
                            Graffiti on Fire_call_box                             23 days 02:15:08.125000
                            Graffiti on Fire_hydrant                              41 days 11:13:08.983333
                            Graffiti on Fire_hydrant_puc                                 30 days 03:11:00
                            Graffiti on Mail_box                                 178 days 07:11:02.402985
                            Graffiti on News_rack                                 52 days 22:34:16.893442
                            Graffiti on Other                                      6 days 04:14:41.105263
                            Graffiti on Other_enter_additional_details_below      39 days 03:42:13.963320
                            Graffiti on Other_for_Parks_ONLY                       5 days 05:39:50.333333
                            Graffiti on Parking_meter                             34 days 23:13:01.350000
                            Graffiti on Pay_phone                                315 days 18:22:16.800000
                            Graffiti on Pole                                      22 days 00:22:35.724543
                            Graffiti on Private Property                          44 days 02:36:16.181818
                            Graffiti on Public Property                                   0 days 00:33:23
                            Graffiti on Sidewalk_in_front_of_property             35 days 06:58:37.156342
                            Graffiti on Sidewalk_structure                        26 days 05:41:34.362694
                            Graffiti on Sign                                     115 days 21:22:17.610169
                            Graffiti on Signal_box                                15 days 03:38:18.579487
                            Graffiti on Street                                    39 days 09:00:18.452631
                            Graffiti on Transit_Shelter_Platform                  34 days 23:05:53.637168
Homeless Concerns           Individual Concerns                                    5 days 13:17:24.202749
Illegal Postings            Illegal Posting - Affixed_Improperly                  75 days 22:11:29.625000
                            Illegal Posting - Multiple_Postings                          16 days 19:52:09
                            Illegal Posting - No_Posting_Date                      5 days 06:49:28.981481
                            Illegal Posting - Posted_Over_70_Days                  9 days 02:07:40.333333
                            Illegal Posting - Posted_on_Directional_Sign                  0 days 16:51:51
                            Illegal Posting - Posted_on_Traffic_Light              6 days 05:12:05.387096
                            Illegal Posting - Posting_Too_High_on_Pole                    1 days 01:03:45
                            Illegal Postings - Affixed_Improperly                 33 days 07:10:58.814814
                            Illegal Postings - Multiple_Postings                  27 days 13:24:58.057142
                            Illegal Postings - No_Posting_Date                     4 days 11:07:49.375000
                            Illegal Postings - Posted_Over_70_Days                11 days 14:26:31.033333
                            Illegal Postings - Posted_on_Directional_Sign          1 days 18:05:32.800000
                            Illegal Postings - Posted_on_Historic_Street_Light            2 days 03:12:32
                            Illegal Postings - Posted_on_Traffic_Light            55 days 19:28:55.790697
                            Illegal Postings - Posting_Too_High_on_Pole                  28 days 02:48:01
                            Illegal Postings - Posting_Too_Large_in_Size           2 days 09:30:54.250000
Litter Receptacles          Add_remove_garbage_can                                77 days 01:59:57.339622
                            Cans_Left_Out_24x7                                    49 days 19:52:47.262500
                            City_Can_Other                                        13 days 06:55:17.062500
                            City_Can_Removed                                              8 days 07:27:49
                            Damaged_City_Can                                      16 days 04:57:57.614285
                            Debris_Box                                                  597 days 08:14:22
                            Debris_box_maintenance_overflowing                     3 days 19:52:44.500000
                            Door_Lock_issues                                       1 days 18:43:43.750000
                            Door_lock_issue                                       16 days 01:29:28.437500
                            Liner_issue_damaged_missing                                   2 days 20:34:45
Name: ttr, dtype: timedelta64[ns]
In [84]:
## for mapping by category 
def f(stepin):
    fig, ax = plt.subplots()
    needle = '(needle)|(syringe)|(sharp)'

    sharps = wrangler.working['Request Details'].str.contains(needle, case=False)
    med_waste = wrangler.working['Request Type'] == 'Medical Waste'
     
    poop2 = wrangler.working['Request Details'].str.contains('Human', case=False)
    camp  = wrangler.working['Request Details'].str.contains('Encampment', case=False)
    wrangler.working[sharps]
    wrangler.working[med_waste]

    zlt = wrangler.working.Latitude == 0
    zln = wrangler.working.Longitude == 0
    zl =  (zln | zlt)

  

    sdic = {'poop': (poop2 & ~zl), 'needles': med_waste, 'people': camp}  #selector label: boolfilter

    #ax1 = wrangler.working[poop2 & ~zl].plot(x='Longitude',y='Latitude', color='red', kind ='scatter',figsize=(10,10), title = 'SF poopmap', alpha=.5)
    #ax2 = wrangler.working[med_waste & ~zl].plot(x='Longitude',y='Latitude', color='orange', kind ='scatter',figsize=(10,10), title = 'SF needles', alpha=.5)


    ax = wrangler.working[sdic[stepin]].plot(x='Longitude',y='Latitude', color='orange', kind ='scatter',figsize=(10,10), title = 'Watch your Step.. ', alpha=.5)
    fig.show()
    
soptions= ['poop','needles','people']
stepin = widgets.SelectionSlider(
    options=soptions,
    value='poop',
    description='I like to step on...',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True
)
In [85]:
## view some neighborhood stats
#wrangler.working.groupby(by=['case_year','Neighborhood','Request Type'])[['CaseID','ttr']].agg({'CaseID': 'count','ttr': lambda x: x.mean()})
ttr_by_neighborhood  = wrangler.working.groupby(by=['Neighborhood','Request Details'])[['CaseID','ttr']].agg({'CaseID': 'count','ttr': lambda x: x.mean()})
poop2 = ttr_by_neighborhood.index.get_level_values('Request Details').str.contains('Human', case=False)
med_waste = ttr_by_neighborhood.index.get_level_values('Request Details') == 'Medical Waste'

ttr_by_neighborhood[poop2]
Out[85]:
CaseID ttr
Neighborhood Request Details
Alamo Square Human or Animal Waste 7 0 days 10:39:00.142857
Apparel City Human or Animal Waste 4 0 days 03:51:19.250000
Ashbury Heights Human or Animal Waste 1 2 days 21:04:00
Bayview Human or Animal Waste 15 2 days 07:26:19.200000
Bernal Heights Human or Animal Waste 8 0 days 21:51:12.375000
Bret Harte Human or Animal Waste 7 2 days 15:28:54.428571
Buena Vista Human or Animal Waste 4 1 days 00:59:01.500000
Castro Human or Animal Waste 15 1 days 06:59:20.800000
Cathedral Hill Human or Animal Waste 26 1 days 12:31:10.538461
Cayuga Human or Animal Waste 4 3 days 05:36:54.750000
Central Waterfront Human or Animal Waste 5 5 days 11:02:18.200000
Chinatown Human or Animal Waste 26 105 days 09:08:40.576923
Civic Center Human or Animal Waste 58 3 days 12:56:02.086206
Cole Valley Human or Animal Waste 4 1 days 01:10:10.500000
Corona Heights Human or Animal Waste 2 0 days 22:01:11
Crocker Amazon Human or Animal Waste 1 1 days 16:55:00
Dogpatch Human or Animal Waste 2 5 days 22:59:01
Dolores Heights Human or Animal Waste 14 1 days 07:56:28.714285
Downtown / Union Square Human or Animal Waste 12 2 days 05:32:20.416666
Duboce Triangle Human or Animal Waste 10 35 days 02:07:01.100000
Eureka Valley Human or Animal Waste 7 1 days 18:41:07.714285
Excelsior Human or Animal Waste 3 0 days 14:28:31.666666
Fairmount Human or Animal Waste 1 0 days 01:05:00
Financial District Human or Animal Waste 16 3 days 12:36:55.375000
Fisherman's Wharf Human or Animal Waste 2 34 days 06:31:17
Glen Park Human or Animal Waste 1 1061 days 08:50:09
Golden Gate Heights Human or Animal Waste 1 1 days 01:34:00
Golden Gate Park Human or Animal Waste 2 3 days 07:41:01.500000
Haight Ashbury Human or Animal Waste 16 44 days 04:29:10
Hayes Valley Human or Animal Waste 16 5 days 03:58:29.562500
Holly Park Human or Animal Waste 3 0 days 22:37:00
Hunters Point Human or Animal Waste 4 0 days 10:57:15
India Basin Human or Animal Waste 2 3 days 18:39:39
Ingleside Human or Animal Waste 6 1 days 19:05:51.666666
Inner Richmond Human or Animal Waste 18 2 days 23:13:12.277777
Inner Sunset Human or Animal Waste 3 0 days 12:16:19
Japantown Human or Animal Waste 5 1 days 00:46:09.800000
Lakeshore Human or Animal Waste 1 0 days 17:48:00
Laurel Heights / Jordan Park Human or Animal Waste 4 1 days 18:38:52.250000
Lone Mountain Human or Animal Waste 3 5 days 20:57:00
Lower Haight Human or Animal Waste 22 3 days 03:12:14.681818
Lower Nob Hill Human or Animal Waste 69 6 days 11:48:26.956521
Lower Pacific Heights Human or Animal Waste 5 2 days 00:53:36.400000
Marina Human or Animal Waste 9 1 days 15:02:09.444444
Merced Heights Human or Animal Waste 1 0 days 17:03:20
Mint Hill Human or Animal Waste 7 2 days 14:54:22.428571
Mission Human or Animal Waste 207 3 days 04:59:57.543689
Mission Bay Human or Animal Waste 8 1 days 19:50:40.375000
Mission Dolores Human or Animal Waste 57 4 days 14:04:03.736842
Mission Terrace Human or Animal Waste 5 1 days 16:00:51.800000
Mt. Davidson Manor Human or Animal Waste 1 1 days 02:22:43
Nob Hill Human or Animal Waste 18 0 days 23:38:52.277777
Noe Valley Human or Animal Waste 3 1 days 08:28:00
North Beach Human or Animal Waste 29 20 days 04:42:02.931034
Northern Waterfront Human or Animal Waste 2 0 days 01:21:02.500000
Oceanview Human or Animal Waste 2 1 days 23:49:53
Outer Mission Human or Animal Waste 1 0 days 02:09:00
Outer Richmond Human or Animal Waste 13 1 days 20:47:46.307692
Outer Sunset Human or Animal Waste 14 3 days 05:56:35.857142
Pacific Heights Human or Animal Waste 15 1 days 11:47:34.800000
Panhandle Human or Animal Waste 6 3 days 09:55:39
Parkside Human or Animal Waste 1 0 days 00:33:00
Peralta Heights Human or Animal Waste 1 1 days 02:41:00
Polk Gulch Human or Animal Waste 19 3 days 04:10:56.578947
Portola Human or Animal Waste 4 4 days 01:28:24
Potrero Hill Human or Animal Waste 29 1 days 18:29:52.965517
Presidio Terrace Human or Animal Waste 2 4 days 21:30:12
Produce Market Human or Animal Waste 4 4 days 10:11:05.750000
Rincon Hill Human or Animal Waste 6 2 days 03:13:58.833333
Russian Hill Human or Animal Waste 8 8 days 21:38:34.125000
Showplace Square Human or Animal Waste 25 5 days 01:06:10.040000
Silver Terrace Human or Animal Waste 1 883 days 02:49:40
South Beach Human or Animal Waste 9 3 days 00:37:45.888888
South of Market Human or Animal Waste 309 11 days 11:59:39.754045
Human/Animal Waste - BART 1 0 days 00:05:00
St. Mary's Park Human or Animal Waste 1 0 days 05:05:37
Stonestown Human or Animal Waste 1 0 days 09:14:08
Sunnyside Human or Animal Waste 4 4 days 13:42:49.500000
Sutro Heights Human or Animal Waste 2 2 days 07:42:50
Telegraph Hill Human or Animal Waste 4 11 days 20:27:32.250000
Tenderloin Human or Animal Waste 328 7 days 15:48:14.289634
Union Street Human or Animal Waste 2 1 days 04:11:36.500000
Upper Market Human or Animal Waste 1 15 days 03:03:00
Visitacion Valley Human or Animal Waste 4 6 days 19:19:52.500000
West Portal Human or Animal Waste 4 2 days 21:42:55.500000
Western Addition Human or Animal Waste 11 28 days 11:17:45.454545
Westwood Park Human or Animal Waste 2 1 days 21:34:30
missing Human or Animal Waste 19 16 days 09:03:37.631578
Human/Animal Waste - BART 1 0 days 19:48:00
human_resources - complaint 1 27 days 22:49:58
human_resources - customer_callback 2 151 days 12:14:19.500000
human_resources - followup_request 1 399 days 21:25:46
human_resources - mailing_request 1 244 days 07:02:00
human_resources - request_for_service 2 1053 days 07:50:11
In [ ]:
 
In [88]:
df['Neighborhoods'].value_counts(normalize=True,dropna=False)
fig, ax = plt.subplots(figsize=(20,5),dpi=144)
ax = sns.countplot(x='Neighborhoods',data=wrangler.working)
plt.show()
In [ ]: